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Abstract 

Replicability analysis aims to identify the findings that replicated across 
independent studies that examine the same features. We provide powerful 
novel replicability analysis procedures for two studies for FWER and for FDR 
control on the replicability claims. The suggested procedures first select the 
promising features from each study solely based on that study, and then test for 
replicability only the features that were selected in both studies. We incorporate 
the plug-in estimates of the fraction of null hypotheses in one study among the 
selected hypotheses by the other study. Since the fraction of nulls in one study 
among the selected features from the other study is typically small, the power 
gain can be remarkable. We provide theoretical guarantees for the control of the 
appropriate error rates, as well as simulations that demonstrate the excellent 
power properties of the suggested procedures. We demonstrate the usefulness of 
our procedures on real data examples from two application fields: behavioural 
genetics and microarray studies. 


1 Introduction 

In modern science, it is often the case that each study screens many features. Iden¬ 
tifying which of the many features screened have replicated findings, and the extent 
of replicability for these features, is of great interest. For example, the association 
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of single nucleotide polymorphisms (SNPs) with a phenotype is typically considered 
a scientific finding only if it has been discovered in independent studies, that exam¬ 
ine the same associations with phenotype, but on different cohorts, with different 


environmental exposures (Heller et ah, 2014a) 


Two studies that examine the same problem may only partially agree on which fea¬ 
tures have signal. For example, in the two microarray studies discussed in Section [872 


among the 22283 probes examined in each study we estimated that 29% have signal in 
both studies, but 32% have signal in exactly one of the studies. Possible explanations 
for having signal in only one of the studies include bias (e.g., in the cohorts selected 
or in the laboratory process), and the fact that the null hypotheses tested may be too 
specific (e.g., to the specific cohorts that were subject to specific exposures in a each 
study). In a typical meta-analysis, all the features with signal in at least one of the 
studies are of interest (estimated to be 61% of the probes in our example). However, 
the subset of the potential meta-analysis findings which have signal in both studies 
may be of particular interest, for both verifiability and generalizability of the results. 
Replicability analysis targets this subset, and aims to identify the features with signal 
in both studies (estimated to be 29% of the probes in our example). 

Formal statistical methods for assessing replicability, when each study examines many 
features, were developed only recently. An empirical Bayes approach for two studies 


was suggested by Li et al. (2014), and for at least two studies by Heller and Yekuticli 


(2014). The accuracy of the empirical Bayes analysis relies on the ability to estimate 


well the unknown parameters, and thus it may not be suitable for applications with 
a small number of features and non-local dependency in the measurements across 


features. A frequentist approach was suggested in Benjamini et al. (2009), which sug¬ 


gested applying the Benjamini-Hochberg (BH) procedure (Benjamini and Hochberg 


1995) to the maximum of the two studies p- values. However, Heller and Yekutieli 


(2014) and Bogomolov and Heller (2013) noted that the power of this procedure may 
be low when there is nothing to discover in most features. |Bogomolov and Heller 


(2013) suggested instead applying twice their procedures for establishing replicability 


from a primary to a follow-up study, where each time one of the studies takes on the 
role of a primary study and the other the role of the follow-up study. 

In this work we suggest novel procedures for establishing replicability across two 
studies, which are especially useful in modern applications when the fraction of fea¬ 
tures with signal is small (e.g., the approaches of Bogomolov and Heller (2013) and 


Benjamini et al. (2009) will be less powerful whenever the fraction of features with 


signal is smaller than half). The advantage of our procedures over previous ones is 
due to two main factors. First, these procedures are based on our novel approach, 
which selects the promising features from each study solely based on that study, and 
then tests for replicability only the features that were selected in both studies. This 
approach focuses attention on the promising features, and has the added advantage 
of reducing the number of features that need to be accounted for in the subsequent 
replicability analysis. Note that since the selection is only a first step, it may be 
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much more liberal than that made by a multiple testing procedure, and can include 
all features that seem interesting to the investigator (see Remark 3.2 for a discussion 


of selection by multiple testing). Second, we incorporate in our procedures estimates 
of the fraction of nulls in one study among the features selected in the other study. 
We show that exploiting these estimates can lead to far more replicability claims 
while still controlling the relevant error measures. For single studies, multiple testing 
procedures that incorporate estimates of the fraction of nulls, i.e. the fraction of fea¬ 


tures in which there is nothing to discover, are called adaptive procedures (Benjamini 


and Hochberg, 

2000 

) or plug-in procedures ( 

Finner and Gontsharuk, 

2009 


One of 

the simplest, and still very popular, estimators is the plug-in estimator, reviewed in 
Section [ET The smaller is the fraction of nulls, the higher is the power gain due to 
the use of the plug-in estimator. In this work, there is a unique opportunity for using 
adaptivity: even if the fraction of nulls in each individual study is close to one, the 
fraction of nulls in study one (two) among the selected features based on study two 
(one) may be small since the selected features are likely to contain mostly features 
with false null hypotheses in both studies. In the data examples we consider, the 
fraction of nulls in one study among the selected in the other study was lower than 
50%, and we show in simulations that the power gain from adaptivity can be large. 

Our procedures also report the strength of the evidence towards replicability by a 


number for each outcome, the r-value for replicability, introduced in Helle r et ah 
(2014a) and reviewed in Section[2j The remaining of the paper is organized as follows. 
In Section [2] we describe the formal mathematical framework. We introduce our new 
non-adaptive FWER- and FDR-replicability analysis procedures in Section [3j and 
their adaptive variants in Section |4j For simplicity, we shall present the notation, 
procedures, and theoretical results for one-sided hypotheses tests in Sections |2]j4j In 
Section [5] we present the necessary modifications for two-sided hypotheses, which turn 
out to be minimal. In Section [6] we suggest selection rules with optimal properties. In 
Sections [7] and [8] we present a simulation study and real data examples, respectively. 
Conclusions are given in Section [9j Lengthy proofs of theoretical results are in the 
Appendix. 


1.1 Review of the plug-in estimator for estimating the frac¬ 
tion of nulls 


Let 7r 0 be the fraction of null hypotheses. Schweder and Spjotvoll (1982) proposed 
estimating this fraction by ^ , where m is the number of features and 

A G (0,1). The slightly inflated plug-in estimator 


7T 0 = 


— values > A} + 1 
m(l — A) 


has been incorporated into multiple testing procedures in recent years. For inde¬ 


pendent p- values, Storey (2003) proved that applying the BH procedure with md 0 
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instead of m controls the FDR, and Finner and Gontsharuk (2009) proved that ap¬ 


plying Bonferroni with rriTTo instead of m controls the FWER. 

Adaptive procedures in single studies have larger power gain over non-adaptive pro¬ 
cedures when the fraction of nulls, 7To, is small. This is so because these procedures 
essentially apply the original procedure at level l/ffo times the nominal level to achieve 


FDR or FWER control at the nominal level. Finner and Gontsharuk (2009) showed 


in simulations that the power gain of using mfro instead of m can be small when the 
fraction of nulls is 60%, but large when the fraction of nulls is 20%. 

The plug-in estimator is typically less conservative (smaller) the larger A is. This 


follows from Lemma 1 in Dickhaus et ah (2012), that showed that for a single study 


the estimator is biased upwards, and that the bias is a decreasing function of A if 
the cumulative distribution function (CDF) of the non-null p-values is concave (if the 
p-values are based on a test statistic whose density is eventually strictly decreasing, 


then concavity will hold, at least for small A). Benjamini et al. (2006) noted that the 
FDR of the BH procedure which incorporates the plug-in estimator with A = 0.5 is 
sensitive to deviations from the assumption of independence, and it may be inflated 
above the nominal level under dependency. Blanchard and Roquain (2009) further 


noted that although under equi-correlation among the test statistics using the plug¬ 
in estimators does not control the FDR with A = 0.5, it does control the FDR with 
A = q/(q+l+l/m) ~ q. Blanchard and Roquain (2009) compared in simulations with 


dependent test statistics the adaptive BH procedure using various estimators of the 
fraction of nulls for single studies, including the plug-in estimator with A G {0.05, 0.5}. 
Their conclusion was that the plug-in estimator with A = 0.05 was superior to all 
other estimators considered, since it had the highest power overall without inflating 
the FDR above the 0.05 nominal level. 


2 Notation, goal, and review for replicability anal¬ 
ysis 


Consider a family of m features examined in two independent studies. The effect of 
feature j G {1,..., m} in study i G {1,2} is Let Hij be the hypothesis indicator, 
so = 0 if 6ij = 6ij, and H tJ = 1 if Q VJ > 6R. 

Let Hj = (Hij,H 2 j). The set of possible states of Hj is % — {h = (hi,h 2 ) : 
(0, 0), (1, 0), (0,1), (1,1)}. The goal of inference is to discover as many features as 
possible with Hj ^ H°, where H° C H. For replicability analysis, H° = H° NR = 
{(0,0), (0,1), (1,0)}. For a typical met a-analysis, H (] = {(0,0)}, and the number of 
features with state (0, 0) can be much smaller than the number of features with states 
in H% r , see the example in Section 

We aim to discover as many features with Hj = (1,1) as possible, i.e., true replicability 
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claims, while controlling for false replicability claims, i.e. replicability claims for 
features with Hj G 'H%r- Let 1Z be the set of indices of features with replicability 
claims. The FWER and FDR for replicability analysis are defined as follows: 


FWER = Pr (\n n {j : H 3 G U° NR }\ > o) , 


FDR = E 


: Hj 6 U° NR } \ 

max(|P|, 1) 


where E(-) is the expectation. 

Our novel procedures first select promising features from each study solely based 
on the data of that study. Let Si be the index set of features selected in study i, 
for i G {1,2}, and let Si = |<S, be their number. The procedures proceed towards 
making replicability claims only on the index set of features which are selected in 
both studies, i.e. <Si fl S 2 . For example, selected sets may include all (or a subset of) 
features with two-sided p-values below a. See Remark T2 for a discussion about the 
selection process. 


Let Pi = (Pji, ..., P im ) be the m-dimensional random vector of p-values of study 
i G {1,2}, and p t = (p*i, ... ,p im ) be its realization. We shall assume the following 
condition is satisfied for (P\ , P 2 ): 


Definition 2.1. The studies satisfy the null independence-across-studies condition if 
for all j with Hj G TL% R , if H {] = 0 then Py is independent of P 2; and if H 2 j = 0 
then P 2 j is independent of Pi. 


This condition is clearly satisfied if the two studies are independent, but it also allows 
the pairs (Pij, Pij) to be dependent for Hj f R° nr - Note moreover that this condition 
does not pose any restriction on the joint distribution of p- values within each study. 


We shall assess the evidence towards replicability by a quantity we call the r-value, 


introduced in Heller et al. (2014a), which is the adjusted p-value for replicability 


analysis. In a single study, the adjusted p- value of a feature is the smallest level (of 
FWER or FDR) at which it is discovered (Wright, 1992). Similarly, for feature j, 


the r-value is the smallest level (of FWER or FDR) at which feature j is declared 
replicable. 

The simplest example of p- value adjustment for a single study i G {1, 2} is Bonferroni, 
with adjusted p-values Pif~ B ° n = ,j = 1 ,...,m. The BH adjusted p- values 


build upon the Bonferroni adjusted p- values (Reiner et al. 
p-value for feature j is defined to be 


2003). The BH adjusted 


min 

fi adj — Bonfadj — Bonf 

i K: P ik ^ Pij 


adj—Bonf 

Pik 


_ 7 / adj—Bonf\ ’ 

, rank[p ik J J ) 


where rank{p c ‘f Bon f j s the rank of the Bonferroni adjusted p-value for feature 
k, with maximum rank for ties. For two studies, we can for example define the 
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Bonferroni-on-max r-values to be r ^ on f~ max — rnmax(pij,p 2 j),j = 1 The 

BH-on-max r-values build upon the Bonferroni-on-max r-values exactly as in single 
studies. The BH-on-max r-value for feature j is defined to be 


r 


Bonf—max 


(7 Bonf — max 

{K: r k 


mm 




Bonf — max 




} rank{r 


Bonf—max\ 5 


where rank(r^ on ^~ max ) is the rank of the Bonferroni-on-max adjusted p -value for fea¬ 
ture k , with maximum rank for ties. Claiming as replicable the findings of all features 
with BH-on-max r-values at most a is equivalent to considering as replicability claims 
the discoveries from applying the BH procedure at level a on the maximum of the 
two studies p-values, suggested in Benjamini et al. (2009). In this work we introduce 
r-values that are typically much smaller than the above-mentioned r-values for fea¬ 
tures selected in both studies, with the same theoretical guarantees upon rejection at 
level a, and thus preferred for replicability analysis of two studies. 


3 Replicability among the selected in each of two 
studies 


Let c € (0,1), with default value c = 0.5, be the fraction of the significance level 
“dedicated” to study one. The Bonferroni r-values are 


Bonf 

r j 


max 


^ S 2 pij Sip 2j ^ 


j e Si n S 2 . 


The FDR r-values build upon the Bonferroni r-values and are necessarily smaller: 

Bonf 


r FDR 

j 


nun 


{i:rf ie*SiruS 2 } rank(rf 


Bonf\ 


j e Si n s 2 


(3.1) 


where rank{rf on is the rank of the Bonferroni r-value for feature i G <Si fl S 2 , with 
maximum rank for ties. 


Declaring as replicated all features with Bonferroni r-values at most a controls the 
FWER at level a, and declaring as replicated all features with FDR r-values at most 


a controls the FDR at level a under independence, see Section 3.1 


The relation between the Bonferroni and FDR r-values is similar to that of the ad¬ 
justed Bonferroni and adjusted BH p -values described in Section [2} For the features 
selected in both studies, if less than half of the features are selected by each study, 
it is easy to show that FDR (Bonferroni) r-values given above, using c = 0.5, will be 
smaller than (1) the BH-on-max (Bonferroni-on-max) r-values described in Section 
[2j and (2) the r-values that correspond to the FDR-controlling symmetric procedure 
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Bogomolov and Heller (2013), which will be typically smaller than BH-on-max 


r-values but larger than FDR r-values in (3.1) due to taking into account the multi¬ 
plicity of all features considered. 


3.1 Theoretical properties 

Let a G (0,1) be the level of control desired, e.g. a = 0.05. Let o,\ = ca be the 
fraction of a for study one, e.g. ot\ = a/ 2 . 

The procedure that makes replicability claims for features with Bonferroni r-values 
at most a is a special case of the following more general procedure. 

Procedure 3.1. FWER-replicability analysis on the selected features Si fl S 2 : 


1. Apply a FWER controlling procedure at level ot\ on the set {pij, j G 1 S 2 }, and 
let IZi be the set of indices of discovered features. Similarly, apply a FWER 
controlling procedure at level a — ai on the set {p 2 j,j G du}, and let 1Z 2 be the 
set of indices of discovered features. 

2. The set of indices of features with replicability claims is IZi D 7 Z 2 - 


When using Bonferroni in Procedure 3.1[ feature j G 1 S 1 fl S 2 is among the discoveries 
if and only if (pij,p 2 j) < ( 01 / 62 , (a — ai)/Si). Therefore, claiming replicability for 


all features with Bonferroni r-values at most a is equivalent to Procedure 3T using 
Bonferroni. 


Theorem 3.1. If the null independence-across-studies condition is satisfied, then 


Procedure 3.1 controls the FWER for replicability analysis at level a. 


Proof. Let V\ = \Hi fl {j : H\j = 0} | and V 2 = \R, 2 Fl {j : H 2 j = 0} | be the number of 
true null hypotheses rejected in study one and in study two, respectively, by Procedure 
3.1| Then the FWER for replicability analysis is 


E(m + V 2 > 0]) < E(E( I[w > 0]|P 2 )) + E(E(I[V 2 > 0]|Pi)). 


Clearly, E( I[Vi > 0]|P 2 ) < oti since P\j is independent of P 2 for all j with H\j = 0 , 
and a FWER controlling procedure is applied on {pij,j G ^ 2 }. Similarly, E(fL\V 2 > 
0]|Pi) < a — ot\. It thus follows that the FWER for replicability analysis is at most 

a. 33 


The procedure that rejects the features with FDR r-values at most a is equivalent to 
the following procedure, see Lemma |B.l for a proof. 


Procedure 3.2. FDR-replicability analysis on the selected features Si fl S 2 •' 
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1. Let 


R = max { r : I 

je-SinS 2 


. rat r(a - a x ) 
\Plj,P2j) < ( ^- 


2. The set of indices with replicability claims is 

Ra i R(a — an) 


K = {] : {pij,P2j) < 


S 2 ’ 51 


= r 


,3 G 5i fi 5 2 }. 


This procedure controls the FDR for replicability analysis at level a as long as the 
selection rules by which the sets 5i and S 2 are selected are stable (this is a very 


lenient requirement, see Bogomolov and Heller (2013) for examples). 


Definition 3.1. (Bogomolov and Heller, 2013) A stable selection rule satisfies the 
following condition: for any selected feature, changing its p-value so that the feature is 
still selected while all other p-values are held fixed, will not change the set of selected 
features. 

Theorem 3.2. If the null independence-across-studies condition is satisfied, and the 


selection rules by which the sets Si and S 2 are selected are stable, then Procedure (L2 
controls the FDR for replicability analysis at level a if one of the following items is 
satisfied: 


(1) The p-values from true null hypotheses within each study are each independent 
of all other p-values. 


(2) Arbi trary dependence among the p-values within each study, when Si in Proce¬ 


dure 


3.2 


is 


replaced by Si Ylk=i !/&> f or * = 1)2. 


See Appendix [B] for a proof. 


Remark 3.1. The FDR r-values for the procedure that is valid for arbitrary depen¬ 
dence, denoted by rJ DR ,j e «5i D «S 2 , are computed using formula (3.1) where the 
Bonferroni r-values rf on ^ are replaced by 


(E£i VtfSiPij (£?= i l/i)Sip 2j 


Si 


rj = max 


1 — c 


j g n S‘2- 


(3.2) 


Remark 3.2. An intuitive approach towards replicability may be to apply a multiple 
testing procedure on each study separately, with discovery sets V i and V 2 in study 
one and two, respectively, and then claim replicability on the set T>i flP 2 . However, 
even if the multiple testing procedure has guaranteed FDR control at level a, it is 
easy to construct examples where the expected fraction of false replicability claims in 
T >i n T >2 will be far larger than a. An extreme example is the following: half of the 
features have Hj = (1,0), the remaining half have Hj = (0,1) , and the signal is 


















very strong. Then in study one all features with Hj = (1, 0) and few features with 
Hj = (0,1) will be discovered, and in study two all features with Hj = (0,1) and few 
features with Hj = (1, 0) will be discovered, resulting in a non-empty set T>iC\T >2 which 
contains only false replicability claims. Interestingly, if the multiple testing procedure 
is Bonferroni at level a, then the FWER on replicability claims of the set T>ir\V 2 is at 
most a. However, this procedure (which can be viewed as Bonferroni on the maximum 
of the two study p-values) can be far more conservative than our suggested Bonferroni- 
type procedure. If we select in each study separately all features with p-values below 
a/2, resulting in selection sets Si and S 2 in study one and two, respectively, then 
using our Bonferroni-type procedure we claim replicability for features with (p\j,p 2 j) < 
(a/(2S 2 ), a/(2S\)). Our discovery thresholds, (a/(2S 2 ),a/(2Si)), are both larger than 
a/m as long as the number of features selected by each study is less than half, and 
thus can lead to more replicability claims with FWER control at level a. 


4 Incorporating the plug-in estimates 


When the non-null hypotheses are mostly non-null in both studies, i.e., there are 
more features with Hj = (1,1) than with Hj = (1,0) or Hj = (0,1), then the 
non-adaptive procedures for replicability analysis may be over conservative. The 
conservativeness follows from the fact that the fraction of null hypotheses in one study 
among the selected in the other study is small. The set is more likely to contain 
hypotheses with Hj E {(1,0), (1,1)} than hypotheses with Hj E {(0,0), (0,1)}, and 
therefore the fraction of true null hypotheses in study two among the selected in 
study one, i.e., X/jes^l H 2 j)/Si, may be much smaller than one (especially if there 
are more features with Hj = (1,1) than with Hj = (1,0)). Similarly, the fraction 
of true null hypotheses in study one among the selected based on study two, i.e., 
E je52 (l — H\j)/S 2 , may be much smaller than one. 


The non-adaptive procedures for replicability analysis in Section [3] control the error- 
rates at levels that are conservative by the expectation of these fractions. Procedures 


3.1 using Bonferroni and 3.2 control the FWER and FDR, respectively, at level which 


is at most 


a\E 


(Ejtsjmwp 

V S 2 


(cy — ot\^E 


(EksP-Hv) 

V Si 


which can be much smaller than a if the above expectations are far smaller than 
one. This upper bound follows for FWER since an upper bound for the FWER of a 
Bonferroni procedure is the desired level times the fraction of null hypotheses in the 


family tested, and for the FDR from the proof of item 1 of Theorem 3.2 


We therefore suggest adaptive variants, that first estimate the expected fractions 
of true null hypotheses among the selected. We use the slightly inflated plug-in 


9 







estimators (reviewed in Section 1.1): 


7T 0 = 


i + Ei(^« > V 

S 2 , a(1 - A) 


- II _ 
7Tn — 


1 + E jeSl . x 1 ( p v > A) 


Si,x(l - A) 


(4.1) 


where 0 < A < 1 is a fixed parameter, S it \ = Si fl {j : < A}, and S';, \ = |S;,a|, for 

i — 1,2. Although 7 Tq and k q 7 depend on the tuning parameter A, we suppress the 
dependence of the estimates on A for ease of notation. 


The adaptive Bonferroni r -values for fixed c = a±/a are: 


adapt B on f 

r j 


max 


TTo-SVaPIj 

5 

c 


KoSi,xP2j \ 
1 -C J’ 


j £ n 52 ,a- 


As in Section [3j the adaptive FDR r-values build upon the adaptive Bonferroni r- 
values: 


1 adaptFDR 


T, 


adaptBonf 


mm 


{i :r ? doptSo " / >r'? d “ I ’‘ Bon/ ! ie5 liAn 52,A} rank(r°; daptBonf ) ’ 


J £ *5i,a n S2,\ 


where rank(rf daptBon *) is the rank of the adaptive Bonferroni r -value for feature 
i £ S i t A (~l 5 2 ) a, with maximum rank for ties. Declaring as replicated all features with 
adaptive Bonferroni/FDR r-values at most a controls the FWER/FDR for replica¬ 
bility analysis at level a under independence, see Section |4~Tj 

The non-adaptive procedures in Section [3] only require as input {p\ 3 : j 6 dj and 
{p 2 j : j £ S 2 }. However, if {pi 3 : j £ Si U and {p 2 j : j G 5i U 5 2 } are available, 
then the adaptive procedures with A = a are attractive alternatives with better power, 
as demonstrated in our simulations detailed in Section [71 


4.1 Theoretical properties 


The following Procedure 4T is equivalent to declaring as replicated all features with 
Bonferroni adaptive r-values at most a. 


Procedure 4.1. Adaptive-Bonferroni-replicability analysis on {{pij,P 2 j) : j G 5 1 U 
£ 2 } with input parameter X: 


1. Compute and S 2 ,\- 

2. Let Hi = {j £ Si ,a : Pij < ai/(S 2 ,\k I 0 )} and 1Z 2 = {j £ S 2 ,\ : P 2 j < (a - 
«i)/(*S' 1 ,a7To / )} be the sets of indices of features discovered in studies one and 
two, respectively. 

3. The set of indices of features with replicability claims is Hi fl H 2 - 
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Theorem 4.1. If the null independence-across-studies condition is satisfied, and the 
p-values from true null hypotheses within each study are jointly independent, then 
Procedure \4.1\ controls the FWER for replicability analysis at level a. 


Proof. It is enough to prove that E(I\V\ > 0]|P 2 ) < ay and P(I[V 2 > 0]|Pi) < 
a —ay, as we showed in the proof of Theorem 3.1| These inequalities essentially follow 
from the fact that the Bonferroni plug-in procedure controls the FWER (Finner and 


Gontsharuk 2009). We will only show that E(I\V\ > 0] |P 2 ) < ay, since the proof 


that E( 1[V 2 > 0]|Pi) < a — a\ is similar. We shall use the fact that 


> 


i + E je52 , A (i-^)i(^>A) 
^ 2 ,a(1 - A) 


(4.2) 


E{ I[W > 0]|P 2 ) = Pr J2 (! - e s i* p u < «i/(^2,a7To)] > 0|P 2 


J65 2 „ 


< (1 - H u )Pr(P u < mm(X,a 1 /S 2 ,xn I 0 )\P 2 

*e5 2 ,A 

/ / 


< E (' - H ») Pl 

*ScS 2i A 


E (! - w.)Pi 

*£*$2 , a 


Pm < min 


\ 


A, 


Ol\ 


i+E jSS2:A (i-^idi(^>A) 

Wa 


/ 


P\i Ei min 


A, 


OL i 


V 


i+E j6 5 2 , AJ ^(i-^ii)FW>A) 
1—A 




\ \ 

P 2 


(4.3) 


(4.4) 


7 


7 


< Y. (i - H ^ E v 


'l + E je52 , A ,^(l-^)I(^>A)' 


*SiS 2i a 


1-A 


< (i * J2 ( x - «i- 

*ScS 2iA JSiS 2i a 


(4.5) 

(4.6) 


Inequality (4.3) follows from the Bonferroni inequality, and inequality (4.4) follows 
from (4.2). Inequality (4.5) follows from the facts that for i with Hu = 0, (a) Pu 
is independent of all null p-values from study one and from all p-values from study 
two, and (b) Pr (P u < x) < x for all x G [0,1]. Inequality (4.6) follows by applying 
Lemma 1 in Benjamini et al. (2006), which states that if Y ~ B(k — 1 ,p) then 
E(l/(Y + 1)) < 1/(A;p), to Y = Xq e s 2 A yy 4 (l - Hij)l(Pij > A), which is distributed 
A j&fi ~ ), 1 — A) if the null p-values within each study are uniformly 

distributed. It is easy to show, using similar arguments, that inequality (4.6) remains 
true when the null p-values are stochastically larger than uniform. O 
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Declaring as replicated all features with adaptive FDR r -values at most a is equivalent 
to Procedure 3.2 where Si and S 2 are replaced by Si^tt^ 1 and S^attq respectively, and 
fl S 2 is replaced by 5 i,a H «S 2) a, see Lemma B.l for a proof. 


Theorem 4.2. If the null independence-across-studies condition holds, the p-values 
corresponding to true null hypotheses are each independent of all the other p-values, 
and the selection rules by which the sets Si and S 2 are selected are stable, then declar¬ 
ing as replicated all features with adaptive FDR r-values at most a controls the FDR 
for replicability analysis at level a. 


See Appendix [B] for a proof. 


5 Directional replicability analysis for two-sided 
alternatives 


So far we have considered one sided alternatives. If a two-sided alternative is consid¬ 
ered for each feature in each study, and the aim is to discover the features that have 
replicated effect in the same direction in both studies, the following simple modifica¬ 
tions are necessary. 

For feature j G {1,... ,m}, the left- and right- sided p-values for study i G {1, 2} are 
denoted by pb and pf*, respectively. For continuous test statistics, p^ = 1 — pb. 

For directional replicability analysis, the selection step has to be modified to include 
also the selection of the direction of testing. The set of features selected is the subset of 
features that are selected from both studies, for which the direction of the alternative 
with the smallest one-sided p-value is the same for both studies, i.e., 


S = Si D S 2 n ({j : max(pf i ,pj) < 0.5} U {j : max(py,p^-) < 0.5}) . 
In addition, define for j G <Si U S 2 , 


Pii 


P2i 


Pij if P2j < P‘L 
Pij if pij > P2P 


p| if pij < pjp 
p?j if Pij > Pij- 


The Bonferroni and FDR r-values are computed for features in S using the formulae 
given in Sections [3] and [ 5 ] (where dup and S 2: \ are the selected sets in Section [ 3 ]), 
with the following modifications: the set <Si fl S 2 is replaced by S , and p\ 0 and p 2 j are 
replaced by p[j and pfj for j G Si U S 2 . 
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As in Sections [3] and [4| features with r-values at most a are declared as replicated 
at level a, in the direction selected. The corresponding procedures remain valid, 
with the theoretical guarantees of directional FWER and FDR control for replica¬ 
bility analysis on the modified selected set above, despite the fact that the direction 
of the alternative for establishing replicability was not known in advance. This is 
remarkable, since it means that there is no additional penalty, beyond the penalty 
for selection used already in the above procedures, for the fact that the direction for 
establishing replicability is also decided upon selection. The proofs are similar to the 
proofs provided for one-sided hypotheses and are therefore omitted. 


6 Estimating the selection thresholds 


When the full data for both studies is available, we first need to select the promising 
features from each study based on the data in this study. If the selection is based 
on p-values, then our first step will include selecting the features with p-values below 
thresholds t\ and t 2 for studies one and two, respectively. The thresholds for selection, 
(ti, £ 2 ) £ (0, l] 2 , affect power: if (ti,t 2 ) are too low, features with Hj ^ H%ji ma Y n °t 
be considered for replicability even if they have a chance of being discovered upon 
selection, thus resulting in power loss; if (ti,t 2 ) are too high, too many features with 
Hj G H ° ni j will be considered for replicability making it difficult to discover the true 
replicated findings, thus resulting in power loss. 

We suggest automated methods for choosing (ii, t 2 ), based on (pi,p 2 ) and the level of 
FWER or FDR control desired, which are based on the following principle: choose the 
values (ti,i 2 ) so that the set of discovered features coincides with the set of selected 
features. We show in simulations in Section [7] that data-dependent thresholds may 
lead to more powerful procedures than procedures with a-priori fixed thresholds. 

Let iSj(L) = {j : pij < ti} be the index set of selected features from study i, for 
i G {1,2}. We suggest the selection thresholds that solve the two equations 


U = 




is 2 (f 2 )r 


h — 


Ol — Qq 


( 6 . 1 ) 


for Procedure 3.1 using Bonferroni, and the selection thresholds (£*,££) that solve the 
two equations 

cci a — CKi 


U = 


|S 2 ,A(t 2 )|^(t 2 ) 


t2 ~ 


|SiWi)l*o'(«i 


( 6 . 2 ) 


for the adaptive Procedure 4.1 for FWER control, where vtq^) and (ti) are the 


estimators defined in (4.1) with = <Si 5 a(1i) = {j ■ Pij < min(A,ti)} and S 2 = 
S 2 ,\ (£ 2 ) = {j '■ P 2 j < min(A,t 2 )}- We show in Appendix |C| that these choices are 
not dominated by any other choices (i.e., there do not exist other choices (ti,t 2 ) that 
result in larger rejection thresholds for the p- values in both studies). 
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Similarly, we suggest the selection thresholds (£*,£ 3 ) that solve the two equations 


£1 = 


|« 5 i(^i) n S2 (£2) |qu 

1^2 (^ 2 ) | 


to — 


|5i(ti) n5 2 (t 2 )|(a - «i) 
|5i(ti)| 


(6.3) 


for Procedure 3.2 for FDR control, and the selection thresholds (££,£ 3 ) that solve the 
two equations 


, _ |« 5 i,a(^i) n «s 2 ,a(£2)1^1 
1_ | 5 2 , a (£ 2 )|^(£ 2 ) ’ 

t _ |5 i iA (£i) n5 2 , A (£ 2 )|(a-ai) 

2 “ 

for the adaptive FDR-controlling procedure in Section [4} 


(6.4) 


If the solution does not exist, no replicability claims are made. There may be more 
than one solution to the equations (6.1) - (6.4). In our simulations and real data 
examples, we set as (£{,£ 3 ) the first solution outputted from the algorithm used for 
solving the system of non-linear equations. We show in simulations that using data- 
dependent thresholds (£{,£ 3 ) results in power close to the power using the optimal 
(yet unknown) fixed thresholds £1 = £ 2 , and that the nominal level of FWER/FDR is 
maintained under independence as well as under dependence as long as we use A = a 
for the adaptive procedures. We prove in Appendix [C] that the nominal level of the 
non-adaptive procedures is controlled even though the selection thresholds (£}, £ 3 ) are 
data-dependent, if the p-values are exchangeable under the null. 


7 Simulations 


We define the configuration vector / = (/oo, / 10 , foi, / 11 ), where f ik = J^'jU I[Hj = 
(l,k)\/m, the proportion of features with state (l,k), for (l,k) G {0,1}. Given /, 
measurements for mfi k features, were generated from lV(/p, 1) for study one, and 
N(nk, 1) for study two, where /x 0 = 0 and pi = /i > 0. One-sided values were 
computed for each feature in each study. We varied / and /j e {2, 2.5,..., 6 } across 
simulations. We also examined the effect of dependence within each study on the 
suggested procedures, by allowing for equi-correlated test statistics within each study, 
with correlation p = {0,0.25,0.5,0.75,0.95}. Specifically, the noise for feature j in 
study i G { 1 , 2 } was = ^/pZ l0 + y/l — pZij, where { Z t] : i = 1 , 2 , j = 1 ,..., m} are 
independent identically distributed N( 0,1) and Z i0 is N( 0,1) random variable that 
is independent of { Z t] : i = 1, 2, j = 1,..., m}. The p-value for feature j in study i 
was 1 — < h(/iq + Bij), where <F(-) is the CDF of the standard normal distribution, and 
Pij is the expectation for the signal of feature j in study i. 

Our goal in this simulation was three-fold: First, to show the advantage of the adap¬ 
tive procedures over the non-adaptive procedures for replicability analysis; Second, 
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to examine the behaviour of the adaptive procedures when the test statistics are 
dependent within studies; Third, to compare the novel procedures with alternatives 
suggested in the literature. The power, FWER, and FDR for the replicability analysis 
procedures considered were estimated based on 5000 simulated datasets. 


7.1 Results for FWER controlling procedures 


We considered the following novel procedures: Bonferroni-replicability with fixed 
or with data-dependent selection thresholds (ti, ^ 2 ), adaptive-Bonferroni-replicability 
with A G {0.05,0.5} and with fixed or with data-dependent (ti,t- 2 ). These proce¬ 
dures were compared to an oracle procedure with data-dependent thresholds (oraclc- 
Bonferroni-replicability), that knows X^jes 2 (t 2 )( 1 — -Hy) and SjeSi(ti)(l — ^ 2 j) an< ^ 
therefore rejects a feature j G iSi(ti) nd> 2 (f 2 ) if and only if p\ 0 < on/ Y2j e s 2 (t 2 )(^ — H\f) 
and p 2 j < (a — a 1 )/X^je 5 1 (t 1 )(l — # 2.7 )■ I * 1 addition, two procedures based on the 
maximum of the two studies p-values were considered: the procedure that declares 
as replicated all features with max(p 1 j,p 2 i) < a/m (Max), and the equivalent oracle 
that knows {j : Hj G 'H% R } | and therefore declares as replicated all features with 
m&x(pu,p 2 i) < a/\{j : Hj G H° NR }\ (oracle Max). Note that the oracle Max pro¬ 
cedure controls the FWER for replicability analysis at the nominal level a since the 
FWER is at most Yh{i-H i ^u Q N fl} Pr(max(p H ,p 2 i) < a/\{j : Hj G %^ R }|} < a. 

Figure [l] shows the power for various fixed selection thresholds t\ — t 2 — t. There is 
a clear gain from adaptivity since the power curves for the adaptive procedures are 
above those for the non-adaptive procedures, for the same fixed threshold t. The gain 
from adaptivity is larger as the difference between f n and /10 = foi is larger: while 
in the last two rows (where / 10 = foi < fn) the power advantage can be greater 
than 10%, in the first row (where f w = foi = 0.1,/n = 0.05) there is almost no 
power advantage. The choice of t matters, and the power of the procedures with 
data-dependent thresholds is close to the power of the procedures with the 

best possible fixed threshold t. 


Figure |2] shows the power and FWER versus p under independence (columns 1 and 
2) and under equi-correlation of the test statistics with p = 0.25 (columns 3 and 4). 
The novel procedures are clearly superior to the Max and Oracle Max procedures, 
the adaptive procedures are superior to the non-adaptive variants, and the power of 
the adaptive procedures with data-dependent thresholds is close to that of the oracle 
Bonferroni procedure. The adaptive procedures with A = 0.05 and A = 0.5 have 
similar power, but the FWER with A = 0.05 is controlled in all dependence settings 
while the FWER with A = 0.5 is above 0.1 in all but the last dependence setting. Our 
results concur with the results of Blanchard and Roquain (2009) for single studies, 
that the preferred parameter is A = 0.05. The adaptive procedure with A = 0.05 and 
data-dependent selection thresholds is clearly superior to the two adaptive procedures 
with fixed selection thresholds of t\ — t 2 — 0.025 or t\ — t 2 — 0.049. We thus 
recommend the adaptive-Bonferroni-replicability procedure with A = 0.05 and data- 
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dependent selection thresholds. 


7.2 Results for FDR controlling procedures 


We considered the following novel procedures for replicability analysis with a = 
0.05, aq = 0.025: Non-adaptive-FDR-replicability with fixed or data-dependent (H,£ 2 ); 
adaptive-FDR-replicability with A G {0.05,0.5} and fixed or data-dependent (ti,t 2 ). 


Heller and Yekutieli (2014) introduced the oracle Bayes procedure (oracleBayes), and 


showed that it has the largest rejection region while controlling the Bayes FDR. When 
m is large and the data is generated from the mixture model, the Bayes FDR coincides 
with the frequentist FDR, so oracle Bayes is optimal for FDR control. We considered 
this oracle procedure for comparison with our novel procedures. The difference in 
power between the oracle Bayes and the novel frequentist procedures shows how 
much worse our procedures, which make no mixture-model assumptions, are from 
the (best yet unknown in practice) oracle procedure, which assumes the mixture 
model and needs as input its parameters. In addition, the following three procedures 
were considered: the empirical Bayes procedure (eBayes), as implemented in the R 


package repfdr (Heller et ah, 2014b), which estimates the Bayes FDR and rejects 


the features with estimated Bayes FDR below a, see Heller and Yekutieli (2014) for 
details; the oracle BH on {max(pij,p 2i ) : i — 1,..., m} (oracleMax); and the adaptive 
BH on (max(pij,p 2 j) : i = 1 (adaptiveMax). Specifics about oracleMax 

and adaptiveMax follow. Applying the BH on {max(pij,p 2 i) : i = 1 at 

level x, it is easy to show that the FDR level for independent features is at most 
foox 2 + (1 — /oo — fn)x. Therefore, the oracleMax procedure uses level x, which is 
the solution to foox 2 + (1 — /oo — fn)x = 0.05, and the adaptiveMax procedure uses 
level x, which is the solution to foox 2 + (1 — /oo — fu)x = 0.05, where /oo and fn are 
the estimated mixture fractions computed using the R package repfdr. 


Figure [3] shows the power of novel procedures for various fixed selection thresholds 
ti — t 2 — t, as well as for the variants with data-dependent thresholds. There is a clear 
gain from adaptivity since the power curves for the adaptive procedures are above 
those for the non-adaptive procedures, for the same fixed threshold t. The choice of t 
matters, and the choice t = 0.025 is better than the choice t = 0.05, and fairly close to 
the best t. We see that the power of the non-adaptive procedures with data-dependent 
selection thresholds is superior to the power of non-adaptive procedures with fixed 
thresholds. The same is true for the adaptive procedures in all the settings except 
for the last two rows of the equi-correlation setting, where the power of the adaptive 
procedures with data-dependent thresholds is slightly lower than the highest power 
for fixed thresholds t\ — t 2 — t. In these settings the number of selected hypotheses 
is on average lower than in other settings, and the fractions of true null hypotheses 
in one study among the selected in the other study are expected to be small. As a 
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0.05, f10=f01= 0.025 



f 11 = 0.05,110=f01 = I 



Figure 1: Column 1: Independence setting; Columns 2: Equicorrelation with 

p = 0.25. The power versus hxed threshed t is shown for p = 3 for the adaptive- 
Bonferroni-replicability procedure (dashed black with A = 0.05 and dotted green 
with A = 0.5) and non-adaptive Bonferroni-replicability procedure (solid red), along 
with the power of these procedures with data-dependent thresholds. In all settings 
























































Figure 2l Columns 1 and 2: Independence setting; Columns 3 and 4: Equi-correlation with p = 0.25. Power 
and FWER versus p for the adaptive-Bonferroni-replicability procedure with data-dependent (£ 1 ,^ 2 ) with A = 0.5 
(solid black ) and with A = 0.05 (dashed black); Bonf^groni-replicability procedure with data-dependent (£ 1 ,^ 2 ) 
(dotted black); the oracle that knows which hypotheses are null in one study among the selected from the other 
study (dashed blue); oracle Max (dotted blue) and Max (dotted red); adaptive-Bonferroni-replicability with fixed 
A = 0.05 and fixed ti = t 2 = 0.049 (dash-dot green) and fixed £1 = £2 = 0.025 (dash green). In all settings m = 1000, 
a = 0.05, olx = 0.025. 






















































result, the solutions to the two non-linear equations solved using the estimates of the 
fractions of nulls are far from optimal. Therefore, when there is dependence within 
each study, and the number of selected hypotheses is small (say less than 100 per 
study), we suggest using the novel adaptive procedures with t\ — t 2 — a/2 instead of 
using data-dependent (ti,t 2 ). 


Figure [4] shows the power and FDR versus /i under independence (columns 1 and 
2) and under equi-correlation of the test statistics with p = 0.25 (columns 3 and 4). 
The novel procedures are clearly superior to the competitors: the empirical Bayes 
procedure does not control the FDR when m = 1000, and the actual level reaches 
above 0.1 under dependence; the oracleMax and adaptiveMax procedures have the 
lowest power in almost all settings. The novel adaptive procedures approach the power 
of the oracle Bayes as f w = foi increase. The adaptive procedures with A = 0.05 
and A = 0.5 have similar power, but the FDR with A = 0.05 is controlled in all 
dependence settings and the FDR with A = 0.5 is above the nominal level in three 
of the dependence settings. Our results concur with the results of Blanchard and 


Roquain (2009) for single studies, that the preferred parameter is A = 0.05. We thus 


recommend the adaptive FDR-replicability procedure with A = a, for FDR control 
at level a. We also recommend using data-dependent (ti, t 2 ), unless the test statistics 
are dependent within each study and the number of selected hypotheses from each 
study is expected to be small. 


8 Examples 

8.1 Laboratory mice studies comparing behaviour across strains 


It is well documented that in different laboratories, the comparison of behaviors of 
the same two strains may lead to opposite conclusions that are both statistically sig¬ 
nificant (Crabbe et al. (1999), Kafkafi et al.| (2005), and Chapter 4 in |Crusio et al. 


(2013)). An explanation may be the different laboratory environment (i.e. personnel, 
equipment, measurement techniques) affecting differently the study strains (i.e. an 
interaction of strain with laboratory). Richter et al. (2011) examined 29 behavioral 


measures from five commonly used behavioral tests (the barrier test, the vertical pole 
test, the elevated zero maze, the open field test, and the novel object test) on female 
mice from different strains in different laboratories with standardized conditions. Ta¬ 
ble 1 shows the one-sided p-value in the direction favored by the data based on the 
comparison of two strains in two laboratories, for each of the 29 outcomes. 


The example is too small for considering the empirical Bayes approach. The approach 
suggested in Benjamini et al. (2009) of using for each feature the maximum of the 
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Figure 3: Column 1: Independence setting; Columns 2: Equi-correlation with p = 
0.25. The power versus hxed threshold £(Js shown for p = 3 for the adaptive and 
non-adaptive FDR-replicability procedures, along with the power of these procedures 
with data-dependent thresholds. In all settings m = 1000, a = 0.05, a\ = 0.025. 








































Figure 4i Columns 1 and 2: Independence setting; Columns 3 and 4: Equi-correlation with p = 0.25. Power and 
FDR versus p for the adaptive-FDR-replicability procedure with data-dependent (ti,t 2 ) with A = 0.5 (solid black) 
and with A = 0.05 (dashed black); Non-adaptive-FDR-r2llh ca kility procedure with data-dependent (£ 1 ,^ 2 ) (dotted 
black); the oracle Bayes (dashed blue) and empirical Bayes (dashed red); the oracle and adaptive BH on maximum 
p- value, (dotted blue and dotted red); adaptive-FDR-replicability procedure with A = 0.05 and fixed t\ = t 2 = 0.049 
(dash-dot green) and fixed t\ = £2 = 0.025 (dash green). In all settings m = 1000, a = 0.05, a± = 0.025. 




















































Tctbl.6 li For 16 female mice from each of two inbred strains, ” C57BL6NCrl” and ”DBA/2NCrl”, in each of two 
laboratories, the Wilcoxon rank sum test one-sided p -value was computed for the test of no association between strain 
and behavioral endpoint. We show the p -values for the lab of H. Wurbel at the University of Giessen in column 3, 
and for the lab of P. Gass at the Central Institute of Mental Health, Mannheim in column 4. The direction of the 
alternative favored by the data is shown in column 2, and it is marked as ”X” if the laboratories differ in the direction 
of smallest one-sided p- value. The rows are the outcomes from 5 behavioural tests: the barrier test (row 1); the 
vertical pole test (row 2); the elevated zero maze (rows 3-11) ; the open field test (rows 12-19); the novel object test 
(rows 20-29). 



Alternative 

min (Pfj 
i = 1 

’A") 

i = 2 


Alternative 

min(P^ 

2=1 

-A?) 

i = 2 

1 

X 

0.3161 

0.0218 

16 

C57BL < DBA 

0.0059 

0.0002 

2 

C57BL < DBA 

0.0012 

0.0000 

17 

C57BL > DBA 

0.0176 

0.0003 

3 

X 

0.0194 

0.1120 

18 

X 

0.0000 

0.0538 

4 

Cb7BL < DBA 

0.0095 

0.2948 

19 

C57BL < DBA 

0.0000 

0.1727 

5 

C57BL < DBA 

0.1326 

0.0028 

20 

C57BL < DBA 

0.0157 

0.0001 

6 

C57BL > DBA 

0.1488 

0.0003 

21 

C57BL < DBA 

0.0000 

0.0234 

7 

Cb7BL > DBA 

0.2248 

0.0000 

22 

C57BL < DBA 

0.3620 

0.0176 

8 

X 

0.4519 

0.0005 

23 

C57BL < DBA 

0.0000 

0.0001 

9 

C57BL < DBA 

0.0061 

0.0000 

24 

C57BL < DBA 

0.0000 

0.0076 

10 

C57BL < DBA 

0.0071 

0.0888 

25 

C57BL < DBA 

0.0000 

0.0000 

11 

X 

0.4297 

0.1602 

26 

C57BL < DBA 

0.0000 

0.0003 

12 

C57BL < DBA 

0.0918 

0.0506 

27 

C57BL < DBA 

0.0000 

0.0001 

13 

X 

0.0918 

0.0001 

28 

C57BL < DBA 

0.0000 

0.0550 

14 

15 

Cb7BL < DBA 

X 

0.0000 

0.0005 

0.0048 

0.0550 

29 

X 

0.0033 

0.3760 


two studies p-values, i.e., 2 min{max(pf-, max(py,p^)}, detected overall fewer 
outcomes than using our novel procedures both for FWER and for FDR control. 

Table 2 shows the FWER/FDR non-adaptive and adaptive r-values, for the selected 
features, according to the rule which selects all features with two-sided p-values that 
are at most 0.05. We did not consider data-dependent thresholds since the number of 
features examined was only 29, which could result in highly variable data-dependent 
thresholds and a power loss comparing to procedures with fixed thresholds, as was 
observed in simulations. At the a = 0.05 level, for FWER control, four discoveries 
were made by using Bonferroni on the maximum p-values, and five discoveries were 
made with the non-adaptive and adaptive Bonferroni-replicability procedures. At 
the a = 0.05 level, for FDR control, nine discoveries were made by using BH on 
the maximum p-values, and nine and twelve discoveries were made with the non- 
adaptive FDR and adaptive FDR-replicability procedures, respectively. Note that 
the adaptive r-values can be less than half the non-adaptive r-values, since 7Tq = 0.44 
and fry 7 = 0.47. 


8.2 Microarray studies comparing groups with different can¬ 
cer severity 


Freije et al. (2004) and Phillips et al. (2004) compared independently the expression 
levels in patients with grade III and grade IV brain cancer. Both studies used 
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Tctbl.6 2: The replicability analysis results for the data in Table 1, after selection of features with two-sided p -values 
at most 0.05 (i.e. t\ = £2 = 0.025). Only the twelve features in <Si fl S 2 are shown, where Si = 20, S 2 = 19. For 
each selected feature, we show the r -values based on Bonferroni (column 5), FDR (column 6), adaptive Bonferroni 
(column 7), and the adaptive FDR (column 8). The adaptive procedures used A = 0.05. 


index 

selected 

Alternative 

“■ B 

I' 1 

’R7) 

i = 2 

Non-adaptive 
Bonf FDR 

Adaptive 

Bonf FDR 

2 

C57BL < DBA 

0.0012 

0.0000 

0.0452 

0.0090 

0.0200 

0.0040 

9 

C57BL < DBA 

0.0061 

0.0000 

0.2323 

0.0290 

0.1029 

0.0129 

14 

C57BL < DBA 

0.0000 

0.0048 

0.1910 

0.0290 

0.0905 

0.0129 

16 

C57BL < DBA 

0.0059 

0.0002 

0.2237 

0.0290 

0.0992 

0.0129 

17 

C57BL > DBA 

0.0176 

0.0003 

0.6679 

0.0607 

0.2960 

0.0269 

20 

C57BL < DBA 

0.0157 

0.0001 

0.5974 

0.0597 

0.2648 

0.0265 

21 

C57BL < DBA 

0.0000 

0.0234 

0.9363 

0.0780 

0.4435 

0.0370 

23 

C57BL < DBA 

0.0000 

0.0001 

0.0022 

0.0011 

0.0010 

0.0005 

24 

C57BL < DBA 

0.0000 

0.0076 

0.3037 

0.0337 

0.1439 

0.0160 

25 

C57BL < DBA 

0.0000 

0.0000 

0.0005 

0.0005 

0.0003 

0.0003 

26 

C57BL < DBA 

0.0000 

0.0003 

0.0126 

0.0032 

0.0060 

0.0015 

27 

C57BL < DBA 

0.0000 

0.0001 

0.0038 

0.0013 

0.0018 

0.0006 


the Affymetrix HG U133 oligonucleotide arrays, with 22283 probes in each study. 


The study of Freije et al. (2004) (GEO accession GSE4412) included 26 subjects 


with tumors diagnosed as grade III glioma and 59 subjects with tumor diagnosis of 
grade IV glioma, all undergoing surgical treatment at the university of California, Los 


Angeles. The study of Phillips et al. (2004) (GEO accession GSE4271) included 24 


grade III subjects, and 76 grade IV subjects, from the M.D. Anderson Cancer Center 
(MDA). The Wilcoxon rank sum test p -values were computed for each probe in each 
study in order to quantify the evidence against no association of probe measurement 
with tumor subgroup. 


We used the R package repfdr (Heller et al. 2014b) to get the following estimated 
fractions, among the 22283 probes: 0.39 with h = (0,0); 0.16 with h = (1,1); 0.13 
with h = (—1, —1); 0.10 with h = (0,1); 0.08 with h = (—1, 0); 0.07 with h = (0, —1); 
0.07 with h = (1,0); 0.00 with h = (-1,1) or h = (1, -1). 


For FWER-replicability, the recommended Procedure 4.1 with A = 0.05 and data- 
dependent thresholds ti = 6.5 * 10 —5 , ^2 = 5.1 * 10~ 5 discovered 340 probes. For 
comparison, the non-adaptive and adaptive Bonferroni-replicability procedure with 
fixed thresholds t\ = t 2 = 0.025 discovered only 90 and 124 probes, respectively. The 
Bonferroni on maximum p-values discovered only 47 probes. 


For FDR-replicability, the recommended adaptive procedure in Section [4] with A = 
0.05 and data-dependent thresholds t\ = 0.021, t 2 = 0.024 discovered 3383 probes. 
For comparison, the non-adaptive and adaptive FDR-replicability procedure with 
fixed selection thresholds t\ = t 2 = 0.025 discovered 2288 and 3299 probes, respec¬ 
tively. The adaptive r-values can be half the non-adaptive r-values, since = 0.51 
and dp 1 = 0.49. Among the two competing approaches, the BH on maximum p- 
values discovered only 1238 probes, and the empirical Bayes procedure discovered 
4320 probes. Among the 3383 probes discovered by our approach, 3377 were also 
discovered by the empirical Bayes procedure. 
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9 Discussion 


In this paper we proposed novel procedures for establishing replicability in two stud¬ 
ies. First, we introduced procedures that take the selected set of features in each 
of two studies, and infer about the replicability of features selected in both studies 
while controlling for false replicability claims. We proved that the FWER controlling 
procedure is valid (i.e., controls the error rate at the desired nominal level) for any 
dependence within each study, and that the FDR controlling procedure is valid un¬ 
der independence of the test statistics within each study, and suggested also a more 
conservative procedure that is valid for arbitrary dependence. Next, we suggested 
incorporating the plug-in estimates of the fraction of nulls in one study among the 
selected features by the other study, which can be estimated as long as the p-values 
for the union of features selected is available. We proved that the resulting adap¬ 
tive FWER and FDR controlling procedures are valid under independence of the test 
statistics within each study. Our empirical investigations showed that the adaptive 
procedures remain valid even when the independence assumption is violated, as long 


as we use A = a as a parameter for the plug-in estimates, as suggested by Blanchard 


and Roquain (2009) for the adaptive BH procedure. Finally, when two full studies are 


available that examine the same features, we suggested selecting features for replica¬ 
bility analysis that have p-values below certain thresholds. We showed that selecting 
the features with one-sided p -values below a/2 has good power, but that the power 
can further be improved if we use data-dependent thresholds, which receive the values 
that will lead to the procedure selecting exactly the features that are discovered as 
having replicated findings. 


Our practical guidelines for establishing replicability are to use the adaptive proce¬ 
dure for the desired error rate control, with A = a. Moreover, based on the simulation 
results we suggest using the data-dependent selection thresholds when two full studies 
are available if the number of selected features in each study is expected to be large 
enough (say above 100), and using the fixed thresholds ti = t 2 = a/2 otherwise. We 
would like to note that the r -value computation is more involved when the thresholds 
are data-dependent, since these thresholds depend on the nominal level a. An inter¬ 
esting open question is how to account for multiple solutions of the two non-linear 
equations that are solved in order to find the data-dependent thresholds. 


The suggested procedures can be generalized to the case that more than two studies 
are available. It is possible to either aggregate multiple results of pairwise replicability 
analyses, or to first aggregate the data and then apply a single replicability analysis on 
two meta-analysis p- values. The aim of the replicability analysis may also be redefined 
to be that of discovering features that have replicated findings in at least u studies, 
where u can range from two to the total number of studies. Other extensions include 
weighting the features differently, as suggested by Genovese et al. (2006), based on 
prior knowledge on the features, and replicability analysis on multiple families of 
hypotheses while controlling more general error rates, as suggested byjBenjamini and 


Bogomolov (2013). 
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A Notation for technical derivations 


For the technical derivations, the following notation will be used. Let p % be the Tri¬ 
dimensional vector of p- values for study i, Sifpi) be the index set of features selected 
from study i based on the vector of p- values pi, and Sjfpt) be the cardinality of this 
set, for i e {1, 2}. Let P- j> = (Pn ,. . ., Pij- 1 , Pij+i, • • • , P vm ) be the vector of p- values 
for the m — 1 features excluding j, for i — 1,2. When the selection rule by which 
the set S{ is selected is stable, define C { 1 ,..., j — 1 , j + 1 ,..., m} as the set of 
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indices selected along with j. if j G S i: and S-^ as sj j) n {7 ^ j : P u < A} if 
for i G {1,2}, and let = |5,f /l |- Define S^\tj) = {l : pu < t i: l j} as the 
index set of features with p -value at most t, t from the vector of p- values p^\ and let 
S^\ti) = |5_P(tj)|. For c G (0,1), we write = ca and a 2 = ot — a.\. 


B Proof of Theorems 3.2 and 4.2 


In the proofs of Theorems T2 and T2 we use the following lemma. The lemma is 
proven in the end of the section. 

Lemma B.l. Let Si be the selected set of features based on study i, for i = 1, 2. Let 
rj, j G fl S 2 be the Bonferroni-type r-values: 


rj = max 


f WiPij W 2 p 2 j ) 

l c ’1-cJ 


, j £ n s 2 , 


(B.l) 


where c G (0,1) is a constant and W \, W 2 may be constants or random variables based 
on p-values. The FDR r-values based on the Bonferroni-type r-values are: 


r FDR 


mm 


n 


{%: ri>rj,i£SinS 2 } rank(ri ) ’ 


j G iSi fl S 2 , 


where rankirf) is the rank of the Bonferroni-type r-value for feature i G <Si with 
maximum rank for ties. 


(1) The procedure that declares as replicated the features with FDR r-values at most 
a is equivalent to the following procedure on the selected features Si fl S 2 : 


(a) Let 


R = max < r : I 

j£SinS 2 


. ran ra 2 
(Plj,P2j) _ [ ^ 


= r 


(b) The set of indices with replicability claims is 

, . f R(y.\ Ret o \ , 

K = {j : (PijiP2j) < , J 6 Si D S 2 }. 


(2) The procedure that declares as replicated the features with FDR r-values at most 
a controls the FDR for replicability analysis at level a if the following conditions 
are satisfied: 

(a) The p-values corresponding to true null hypotheses are each independent 
of all the other p-values. 
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(b) For each j G {1,..., m}, there exist random variables (or constants) W[ 3 \ W 2 


defined on the space {P[ 3 \ P 2 J> ) such that if j G <Si fl S 2 , then W\ = 
W 2 = W 2 \and for arbitrary fixed vectors piand p 2 it holds: 


U) 


I\j g S 2 (p 2 )}E(l/w[ j) | P 2 = P2 ) < 


^—'j&S 2 (jP2) ^lj) 


(B.2) 


I[j eS 1 (p 1 )}E(l/w¥ ) \P 1 =p 1 )< 


h 


(B.3) 


Proof of ite m 1 of Theorem 13.21 T he res ult of item 1 of Theorem 13.21 follows from 
Lemma 


B.l 


The conditions of Lemma 


B.l 


hold with W 1 =S 2 , W[ j) = 1 + S%\ and 


W 2 — Si, W 2 ^ = 1 + S[' ] . In order to see it, note that for j G <Si fl S 2 , S) = 1 + Sj 3) 
for i G {1, 2}. In addition, note that for arbitrary fixed vector p 2 and j G {1,... ,m} 

1 , „ \ _ r „ , „„ / 1 


I \j G S 2 (p 2 )]E 


W( 


U) 


P 2 = P2\ =1 [j G S 2 (p 2 )]E 


=1 \j G S 2 {p 2 )\ 


1 + S 2 

1 


M = P 2 


< 


Esfe) J E J e5 2 (p2)( 1 ~ 

Thus we have proved inequality (B.2). The proof of inequality ( |B.3 ) is similar. 

Proof of Theorem 14.21 The result of Theorem 14.21 follows from Lemma IB. 11 The 
conditions of Lemma B.l|hold with S^x, i — 1, 2 as the selected sets, and 


1 + E*=«j I(Pii>A) 1 + E,ps IfPii > A) 

Wi = S 2 , 1 --, W? = Z "* 6 *' M * 1 ; 


(1-A) 


W 2 = s ltX K = 


1 + E^S 1(^2* > A) 

II _ V > lxr (j) 


-, 1TE = 


(1-A) 

1 + > A ) 


(1-A) ’ 2 (1-A) 

In order to see it, note that if j G <Si,a fl S 2t \, it holds that max {Pi j, P 2] } < A. In 


addition, it was shown in the proof of Theorem 4T that for arbitrary fixed vector p 2 
and j G {1,..., m} 


I \j G S 2 , x (p 2 )}E 




ti) 


I \j G S 2 ,x(p 2 )]E ( 1/ 

1 


P 2 = P 2 J < 

1 + Eie5 2l A,i ? tj(^ — Hij)I(Pu > X) 

1-A 


Pi = P2 I < 


E 


j&S 2 ,\{p2) 


(1 ~H 


ij) 


The proof of inequality (B.3) is similar 


Proof of item 1 of Lemma |B.1[ Note that the procedure given in item 1 of Lemma 
IB.II can be written as follows: 







































1. Let 


R 


max < r 


E 

jeS 1 nS 2 


I 


Vj < ra\ = r 


2. The set of indices with replicability claims is 


R = {j : rj < Ra, j G Si fl S 2 }. 


Let r 0 = max jr : EjeSinS 2 1 Vj — ra \ — r } ■ We prove that R = r 0 by contradic¬ 
tion. From the definitions of ro and R it follows that if R ^ ro, then tq > R, 
and E je5l n 5 2 1 Vj < r o«] > T-o + l- However, since s 2 1 Vj ^ ( r o + !)«] > 

E jes 1 ns 2 1 Vj ^ r o«] d follows that r 0 + 1 is also in jr : E jG 5,n5 2 1 E < H > r}, 
thus contradicting the definition of ro as being the greatest value in this set. Thus 
we have proved that 

R = max < r : I [rj < ra] > r > . (B.4) 

l ie-Sin5 2 J 


We now prove that the procedure that declares as replicated the features with FDR 
r-values at most a is equivalent to the following procedure in item 1, i.e. 


{j : r, K < a} = {j : r, < Ra}, 


(B.5) 


where R is given in (B.4). Let us first prove that 

{j : r RDR < a} C {j : r, < .Ra}. 


(B.6) 


Let j G {j : r RDR < a} be arbitrary fixed. There exists io G fl such that r io > rj 


and 


»0 


mm 


< a. 


rank(ri 0 ) {i:r- i >rj,ie5in5 2 } rank(ri ) 

Thus r io < rank(r io )a. Therefore, rank(r io ) < EjeSins 2 1 Vj — rank(r io )a]. This 
inequality and the expression for R given in (B.4) yield that rank(ri 0 ) < R. It follows 
that r io < Ra. Recall that r 3 < r l0 , therefore rj < Ra. Thus we have proved (B.6). 
Let us now prove that 


{j : rj < Ra} C {j : r RDR < a}. 


(B.7) 


Let j G d>i fl S 2 be an arbitrary fixed index such that rj < Ra. Since rj < r^, and 
-^r- < a (where r^ is the R’th largest r-value), it follows that 


FDR 


mm 


{v. n>rj,ieSi ns 2 } rank(ri 


< a. 


Thus we have proved (B.7), w) 


heh completes the proof of (B.5) and of item 1. 


Proof of item 2 of Lemma 


B.l 


For j G {!,..., m} let us define Cjf^ as the event 
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in which if r RDR < a, then the total number of FDR r-values which are at most a 
is k. It follows from item 1 and from condition (ii) of item 2 that the event C!^ } is 
defined on the space (P^\P^) as follows. Let 



max 




w¥ ) P 2i\ 
r-c J 


oo 


if i e n S?, 

otherwise. 


and let t\ :,) < ... < be the sorted T-values, where we set = 0. Note that 
T t (j) = rj for i G fl S 2 J) . It follows from the equivalent procedure given in item 1 
of Lemma IB. II that 

cf = {(dh Pip ■ dL) 2 ka . T m >(* + !)«.■■■. r,2_i) > ma}. (B.8) 


Note that given P\ = pi, for j G du(pi), Cjf^ = 0 for k > Si(pi), since the number of 
finite is smaller or equal to Si(pi) — 1. Similarly, given P 2 = p 2 , for j G S 2 (p- 2 ), 

C^ ] = 0 for k > S 2 {p 2 ). In addition, note that C^’ 1 and C$ are disjoint events for 
any k ^ k' and \Pi = Pi) = Y)k= Pr(C^|P 2 = p 2 ) = 1. 

The FDR for replicability analysis is 

m m .. 

FDR = £(1 - HuHy) V _p r (j e 5, n rf DR < a, Cf ) 

j =1 fc=l 

m m 

= E* 1 - H v H v) E i Pr 0 eS,n S2t r> - ka ■ C L) < B - 9 ) 

3=1 k =1 

m m 1 

< El 1 - tf«) E F Pr (■? e 5 1 n ^ ka • C f ) (B !0) 

J=1 fc=l 

m m 1 

+ E(! - Hy) E P Pr (j 6 n r J < fc “. cf) (B.ll) 

J=1 fe=l 


where the equality in (B.9) follows from item 1, and the inequality in (B.10) follows 
from the fact that 1 — HijH 2 j < 2 —H]j — H 2] for all j G {1,..., m}. We prove that for 
(pi , p 2 ) arbitrary fixed, the following inequalities hold for conditional expectations. 


- Hl i) T Pr 0 e n 5 2 , Tj < ka, & k j) | P 2 = p 2 ) < ay, 


3 = 1 
m 


k =1 
m 


P‘2 = P2j < «1, 

(B.12) 

P\ = Pi) < a2. 

(B.13) 


2_P~hv)2^\p 

j =1 k= 1 

Note that since these inequalities hold for all pi and p 2 , they yield that the upper 
bounds in (B.12) and (B.13) hold for expressions in (B.10) and (B.ll) respectively, 
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therefore FDR for replicability analysis is upper bounded by aq + aq = ot. Thus it 


remains to prove inequalities (B.12) and (B.13). We now prove inequality (B.12) 


j= 1 


k =1 


T Pr U G n ^2, r 3 < ka, C { k j) \ P 2 = p 2 


S 2 D 2 ) 


V (1 - Hii) £ r p r (j 6 Si, r s < to, C“ | P 2 = p 2 ) < 


j&S 2 (p 2 ) 


k= 1 

S2{P2) 


S 2 KP 2 ) 1 / , 

S (1 “ ^ S fc Pr U G Pl 2 ^ 

j&S 2 (p 2 ) k =1 \ W 


S 2 (p 2 ) 


a 1 




jes 2 (p 2 ) 


k= 1 V^I 
J <S’ 2 (P 2 ) 

a, £ (1 - I ^5T E 1 




0) 


j&s 2 (p 2 ) 


Wf’ 


G 


0) 



(B.14) 

(B.15) 



(B.16) 

(B.17) 


— Ctq. 


The inequality in (B.14) follows from condition (m) of item 2. The inequality in 


(B.15) follows from the fact that the distribution of Py is uniform or stochastically 
larger than unifor m and Py with Hy = 0 is independent of all other p-values. The 
equality in (B.16) follows from the fact that given P 2 = p- 2 , is the whole 

sample space represented as a union of di sjoint events (as discussed above), therefore 

\^S 2 (p 2 ) j 

/Efc=1 L 


a 


U) 


= 1. The inequality in 


ity (B.2). Thus we proved inec 


B.17) follows from condition ( n ) of item 


uality (B.12). Inequality (B.13) is proved 


2, inequa 
similarly. 

Proof of item 2 in Theorem 3.2[ The proof is similar to the proof of item 3 of 
Theorem S3.2 in the Supplementary Material of Bogomolov and Heller (2013). We 
give it below for completeness. For j, k G {1,..., m}, we define as the event 
in which if fJ DR < a, then the total number of arbitrary-dependence FDR r-values 
which are at most a is k. Similarly to the proof of item 2 of Lemma |B. II it can be 


shown that the event is defined on the space of {P\ 3> ■, P-f 1 ) as hi (B.8), where 
T-values are replaced by T- values which are defined as follows. 


U) r>tih 


j^U) _, 


max 


(j) (j) 

(EfE +1 vp(g2 J) +i)PH (eLI 1 , +1) m(s[ j) +i) P2i 


if i g s[ j) n 

otherwise. 


:C?) 


1 — C 


OO 
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Note that = f\ for i G fl S^\ where the expression for rq is given in (3.2). 


Similarly to the proof of item 2 of Lemma B.l, it can be shown that given P\ = pi, 
= 0 for j G <Si(pi) and k > S±(pi), and is the whole sample space. 


Given P 2 = p 2j C^ = 0 for j G S 2 (p 2 ) and k > S 2 (p 2 ), and U l^ 2) C [ k J> is the whole 
sample space. In addition, and C^) are disjoint events for any k ^ k'. 

We ob tain t he foll owing inequality for the FDR for replicability analysis using deriva¬ 
tions (B.9 )- (|b. 11) where we replace r RDR , rj and by r RDR , kj and C^ \ respec¬ 
tively. 


S2(p2) r <(j) 


FDR < Ed - E -Pr (j G Si fl S 2 , rj < ka , C l 


j =i 


k =1 
m 


-iU) 

k 


k= 1 


y^(f - H 2j ) Y / : Pr [J e Si n s 2 , kj < ka , C' 

i=i 


(B-18) 


We now hnd an upper bound for the hrst term of the sum in (B.18). Let p 2 be 
arbitrary fixed. We define aq = Y 1/i). We shall prove that 


III / IL ^ 

y^(l - #ij) ^ -Pr (j G Si H S 2 , fj < ka , C'^ } | P 2 = P2 ) < aq. 

j =1 fc=i 


(B.19) 


Note that 


j=i 

S 2 {p 2 ) 


E( 1 - ffu) E i Pr ( 1 6 s i n s **>. < 5 ® 1 a 


2 = P2 = 


E d-// 


1 j 


IX 


k =1 


j&S 2 (V2) 


k =1 


Si 


E i Pr ( S 1 ( E V* 1 J e -Sl. PlJ < <5« I A = P2 


, Z— 1 




S 2 (P2) 1 y , _ 

< E d-»«) E ^ Pr ( p «£^-d“i-P2=P2 


fe=i 


5*2 (P2 


jeS 2 (p 2 ) 

For each j with H\j — 0, k G {1, ..., and l G {1,..., h}. let us dehne: 


Pjki = Pr P\j e 


(7 — l)aq /aq 


S'sfe) ’ <S , 2(P2) 


.<5“l p 2 = P2|. 


Note that for j with Hij = 0, Pr (P^- < x) < x for all x G [0,1], in particular 
Pr (Pij = 0) = 0. Therefore, for each j with H\ 3 = 0 and k G {1,..., S 2 (p 2 )}, 




Pr ' Plj ~ 1 Pa = P2 J = ^ m ■ 
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Using this equality we obtain: 
_ S 2 (p 2 ) ^ 

e a- h v) y 


j£S 2 (p 2 ) 


k =1 

S 2 (p2) ^ k 


lj — 


fccii 
5*2 (p' 2 ’ 


,C«|f. 2 =p 2 ) = 


52(P 2 ) S 2 (p2) 


'y'j (-*- Hij) ^ Pjki — h^ jki — 


j£S 2 (p 2 ) 


k 

k =1 Z=1 

S 2 (p 2 ) s 2 (p 2 ) . 


jS5 2 (p2) 


Z=1 fc=2 
52(P2) ^ S 2 (p 2 ) 


i 1 - H y) Y1 Y2 j p i ki - J2 t 1 - Hi ^ Y1 7 5Z 


(B. 20 ) 


irJM - / ^ V / V J 

j£S2(p2) J=1 k=l jG«S2(p2) J=1 fe=l 

Since is a union of disjoint events, we obtain for each j with H\j = 0 and 

/ G {1,..., <S 2 (p 2 )}: 

52 (^ 2 ) 


^ Pifcz = Pr Pij G 


fc=i 

< Pr ( P\j G 


(/ — l)bi Zai 


<S , 2(P2) ’ S 2 (p 2 ) 

(/ — l)o:i la 1 


| ,52(P2)^-yO') I n _ 

1 U fc=l L 'k \^2-P2 


= Pl ' P « £ Sfe 


<5 2 (p 2 ) ’ *5*2(.P2)_ 
/bi 


P2 =P 2 


|P2 = £>2 




A = P2 ■ 


Therefore for each j with H\j = 0 we obtain: 

S 2 {p 2 ) 


S 2 {p 2 ) S 2 (p 2 ) 




l 

1=1 k =1 
52 (P 2 ) 


Z=1 


E i Pr p « < 


1. 

T 

1=1 

52 (P2 ) 1 

= E 

z=i 

52 (P2 ) 1 

s E 


/Oi 

SM 


Pr I P tj < 


P2 = P2 


la 1 


<5 2 (p 2 ) 

S 2 (p 2 ) — 1 

- E 

1 =0 


P2 = P2 - Pr Pu < 


(l — l)oi 
*5*2 (^2 ) 


^2 = P2 


Z + l 


Pr ( Pij < 


la\ 

S 2 (p 2 ) 


P2 = P 2 


l 1 +1 


Pr 1 y s ik 


Oi 


+ 


Oi 


—' 1 + 1\S 2 {P2)J S 2 (P2) \S 2 (p 2 ) J 1 S 2 (p 2 ) 


P*2 = P2 + 


Oi 


<52 (P2 


-Pr (Pij <a 1 \P 2 = p 2 ) 


S 2 (p 2 ) 

E 


Oi 


(B.21) 


The inequality in (B.21) follows from the null independence-across-studies condition 
and the fact that for j with H\j = 0, Pr(Pij < x) < x for all x G [0,1]. Combining 
(IB.201) with (] B.21 h we obtain the inequality in (IB. 191): 


52 (P 2 ) 

Ed- ff y) E x: Pr ( p y < 


ie5 2 (p2) 


fc=i 


/coi 

S 2 (p 2 


, C*f ’ | P 2 = P2 ) < «1 


^je5 2 (p2)^ -^li) 

<5 2 (p 2 ) 


< Oi- 


We have proved that the inequality in (B.19) holds for p 2 arbitrary fixed, therefore 
the first term of the sum in (B.18) is upper bounded by Oi. Similarly it can be proven 
that the second term of the sum in (B.18) is upper bounded by o 2 . ft follows from 
(B.18) that these two inequalities yield FDR < ctq + a 2 = a. 
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C Theoretical properties for Section [6] 


We use the following lemma to justify the empirical selection of (ti,t 2 ) for Procedure 
13.II based on Bonferroni. 

Lemma C.l. Assume that Gi(ti),i G {1,2} are monotone increasing functions. Let 
(tftf) be the solution to the following two equations: 

OL\ OL 2 

h = c 2 (t 2 y h = Gh(fi) 

Then there does not exist a pair (ti,t 2 ) that dominates (tytf) the following sense: 

min (^cfe)\ > / min (^4)\ 

min (chW)’ t2 T ^(gRr)’^) 

where the strict inequality means that both coordinates are at least as large with (ti, t 2 ) 
as with {t*i,t* 2 ), but at least one coordinate is strictly larger. 


(C.l) 


See Appendix |C.l| for a proof. Clearly, S\(ti) and are increasing functions 

of £ 1 , and similarly S 2 (t 2 ) and \<S 2 ,\(t 2 )\iTQ are increasing functions of t 2 . From Lemma 
C.l|it follows that the choice (£*, t 2 ) in equations (6.1) or (6.2) is not dominated by any 


other choice of (fi,f 2 ) hr Procedure 3.2 and Procedure 4.1 respectively. Therefore, 
we suggest these data-dependent ( t\,t 2 )■ 


Our next theorems state that the FWER and FDR of the non-adaptive procedures 
using the above data-dependent thresholds for selection are controlled under inde¬ 
pendence. 


Theorem C.l. If the p-values from true null hypotheses within each study are ex¬ 
changeable, and each independent of all other p-values, then Procedure 3A_ based on 
Bonferroni with selection thresholds (t\,t 2 ) = (t\,t 2 ) which are a single solution to 
equations (6.1) controls the FWER for replicability analysis at level a. 


The proof of Theorem C.l is given in Appendix C.2 


Theorem C.2. If the p-values from true null hypotheses within each study are ex¬ 
changeable, and each independent of all other p-values, Procedure \3.I\ with selection 
thresholds (ti,t 2 ) = (tl,t 2 ) which are a single solution to equations (6.3) controls the 
FDR for replicability analysis at level a. 


The proof of Theorem C.2 is given in Appendix C.3 
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C.l 


Proof of Lemma 


C.l 


The proof is by contradiction. Suppose that there exists a pair (£°, t? 2 ) that dominates 
(£*,£ 2 ), br the sense that 


oig) \ > 

lllhl 

Then either the first coordinate or the second coordinate satisfy a strict inequality. 
Without loss of generality, assume that the first coordinate satisfies a strict inequality, 

i.e. 


t* 

L i 

£* 

L 2 


(y,\ 

G 2 (£o) 


< min 



Ot 2 

gM 


< min 



(C.2) 

(C.3) 


It follows from (C.2) that G ^ t *^ < c^t ") 1 th ere f° re using the fact that G 2 (t 2 ) is a 
monotone increasing function we obtain that t 2 < t 2 . It follows from (|C.3) that 


£2 > £ 2 - A contradiction is thus reached. 


C.2 


Proof of Theorem 


C.l 


Procedure 3.1 based on Bonferroni makes replicability claims for features with in¬ 
dices in the set {j : P\ 3 < u 1 , P 2 j < w 2 }, where U\ = min(£ 1 , a 1 /S' 2 (£ 2 )), u 2 = 
min(£ 2 , ot 2 /Si(t\)). Obviously the choice of selection thresholds £*, £ 2 solving the equa¬ 
tions (6.1) leads to the rejection thresholds {u\,u* 2 ) = (£j,£ 2 ), i.e. satisfying 

u i — Oii/S 2 (u 2 ),u 2 = a 2 /S 1 (u\). (C.4) 


Thus the FWER of Procedure 3.1 using (£*,£ 2 ) is bounded above by 


^(1 - //; /) I *r( I’\ j < u* 2 , P 2j < u 2 ) + ^(1 - H 2j )Pr(Pij < u\,P 2j < 
3 =1 3 = 1 


- u 2 


We shall only show that the upper bound of the first sum is at most a 1 , since the 
proof that the upper bound of the second sum is at most a 2 follows similarly. 

For each j e {1,..., m}, define (u*^\ u* 2 ^) to be the solution of the equations 

S 2 {u 2 )ui = Qii, [S[ j \ Ul ) + l]n 2 = a 2 . 

Note that {u-^\u* 2 ^) are independent of P 2 j and that if P 13 < u\ then ( u{,u 2 ) = 

/ *0) *(i)\ 

(“l ,«2 )■ 
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Consider now j with Hj G {(0, 0), (0,1)}: 

Pr(-Pij < ul,P 2j < u* 2 \p[ 3 \p 2 


= Pr(P lj < u\ {j \P 2j < «,<) = {u"\uV>)\P?\P 2 ) 

< Pr(P li < ai 


*C?) 


,*U) „,*(j)\|d(j) 


S 2 (ul ,i] ) 


-,P 2 i <nf ) \P?,P 2 ) 


a i 


s 2 (u; u >) 


nrllPii < 


u. 


-Uh 


(C.5) 


The equality (C.5) follows from the fact that H\j = 0, so P\j has a uniform distribution 
(or is stochastically larger than uniform). Let j 0 G {j : Hj G {(0,0), (0,1)}} be an 
arbitrary fixed index . It thus follows that 


£(1 - ffij)Pr(Pij < Py < «5) = J](l - H„)B{Pr(P u < < ^IP'Gp)} 

j = 1 i=i 




Cti 


J=1 


E(!-^ij)P 

i=i 


&(<4 W ) 

a.\ 

sib? 5 ) 


![pj < u 


*C?)i 


I[Pj < 


W. 


< a.\E 


f E™,(1 - H u )l [Py < 


u. 


«Oo)l 




■0'o)l 


< «!• 


(C.6) 

(C.7) 


The equality (C.6) follows from the fact that the distribution of yP±°\ P 2 J is the 
same as that of f P [ J) , P 2 j for every for every j with Hj G {(0,0), (0,1)}, since the 


p-values are assumed to be independent and exchangeable under the null. 


C.3 Proof of Theorem C.2 


Procedure 3.2 makes the replicability claims for features with indices in the set {j : 


P\j < Ui,P 2 j < u 2 }, where U\ = min(t 1 , Rai/S 2 (t 2 )), u 2 = min(i 2 , Ra 2 /Si(ti), and 


R = max { r : I 

jG5i(ti)n5 2 (t2) 


(Plj,p2j) < 


ra i ra 2 


s 2 (t 2 y 5i(ti) 


= r 


Note that R < |«5i(ti) fl S 2 (t 2 )\. In addition, when the choice of selection thresholds 
is (ti,t 2 ) = (t{,t 2 ), which satisfy 


+* — 


|5 1 (tI)D5 2 (^)|ai |5i(^)D5 2 (^)|a 2 

5 ^2 — 


s 2 (t* 




(C.8) 
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it holds that 


E 


I 


jeSi(t* 1 )nS 2 (t*) 


(Plj,P2j) < 


|Si(fJ) n5 2 (^)| ai |5x(^) n5 2 (r 2 )|a 2 


s 2 (n 


Sift) 


= \s 1 te)ns 2 te)\. 


Therefore, the choice of selection thresholds £*,£ 2 solving the equations in (C.8) leads 
to R — |5i(f^) fl <S 2 (£ 2 )|, and to the rejection thresholds (n*,u 2 ) = (f^,f 2 )- Thus the 


FDR of Procedure 3.2 using (£|,£ 2 ) is bounded above by 


FDR = E 


E 


EJll l ( p i] < 7i l\ p 2] < U* 2 ) 
E7 = i( 1 - H 2j )I{Py < ul P 2j < Up 
Er=i MPj < y*u F23 < u* 2 ) 


(C.9) 


where u\ and w 2 satisfy 


* |Si(ui)nS 2 (wS)|ai * 
" 1 = - S&Q - ’" 2 


|5iK)n5 2 (^)|a 2 

5iK) 


We shall only show that the first term of the sum in (C.9) is upper bounded by ot\. 


The second term of the sum in (C.9) is upper bounded by a 2 , which yields that the 


FDR is upper bounded by a. The proof that the the second term of the sum in (C.9) 
is at most a 2 follows similarly and is therefore omitted. 


For j G { 1 ,..., m}, define (w* , u 2 ) to be the solution of the equations 


(I [P 2j < u 2 \ + \S?{ Ul ) n Sj j) (u 2 )|)ai 
S 2 (u 2 ) 

(I [Py < u 2 \ + IggVi) n S^(u 2 )\)q 2 

1 + ^(Mi) 


(C.10) 


Note that if P\j < then (u*,u 2 ) = « 2 ^), and |<Si(u*) fl<S 2 (u 2 )| = I [P 2 j < 

u 2 ] +1 S± ] {ul ij) ) (u * 2 U) ) |. In addition, both (u^ , u*^) andI[P 2 j < w 2 ]+|5^(m*^)D 
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So'[U 


are independent of P\ r Therefore, 


ET= l(l - Hv)l [Py < tij, P 2j < M*] (J) 

Er=ii[^<<^<^] 1 1 ’ 2 

vE-i rr \p - ?i 1 > P 'E < m!|] I p0‘) p 


/ I[Plj < Ml, P 2 j < = % 


J=1 


O') 



a, = it. 


■0)i 


, U/ 2 


V I[P 2J < u 2 u) ] + n<s^(u 2 


U)r„ *0h 


’0)^,*0)'| 


Pi (i) ,P 2 


I[Pq < P 2 j < u 


,*0)i 


I[P 2j < u* (j) ] + ^(m^) fl5H« 


O'V *0b 


:0)r„*0T 


P 


l‘\P2 


Pr(Pi, < P U, )I[^ <« 


,*0)i 


i[p 2 , < «;«] + ISPK 0 ') ns«'(4 






Mi 


’![% < 


U n 


< < «P'] + |S!"K W ) ns 2 u '(«; 


— Q^l 




i=l 


^ 2 (m 2 0) ) 


(C.ll) 

(C.12) 

(C.13) 


The equality in (C.ll) follows from the fact that Pp and (P{ J \ P 2 ) are independent 
for any j with Hp = 0. The inequality in (C.12) follows from the fact that P\j has 
a distribution at least stocha stical ly as large as the uniform distribution for j with 
H\j = 0, and the equality in ( C.13[ ) follows from the definition of (m*^,m 2 ^). 

Let jo be an arbitrary fixed index in {j : Hp = 0}. It thus follows that 


E 


E 


~ Hp)l[Pp < u\, P 2J < u* 2 \ 
pV^/1 _ TT \ ( I[Plj < Ml, P 2 j < U* 2 \ 

U { iJ Ie7=iI[Pi,<m I,^-<m* 2 


P[ j \P2 


m 


< cti 5^(1 - H^E 
3 =1 


V S 2 { 


= a\E <| 


fE”li(i-Pi,)i[P 2 ,< 


S 2 (u. 


*Uo)' 



< Oil- 


(C.14) 


The equality in (C.14) follows from the fact that the distribution of yP\ ]0 \ P 2 j is the 

same as that of [Pi\ P 2 ^ for every j with Hp = 0, since the p-values are assumed 
to be independent and exchangeable under the null. 


38 





















